#Using R markdown

##Working with R Markdown

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Installing the robis package

First install the robis package. If the package is already installed, make sure the version is 2.10.0 or higher.

if (!"robis" %in% rownames(installed.packages())) {
  remotes::install_github("iobis/robis")
} else if (packageVersion("robis") < "2.10.0") {
  knitr::knit_exit()
}

We will also need the following packages:

library(robis)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(ggtree)
## ggtree v3.13.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
library(mapview)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(RColorBrewer)
library(rmarkdown)

eDNA data access in OBIS

In this notebook we will explore data published to OBIS using the new DNADerivedData extension. This notebook is available at https://iobis.github.io/notebook-dnaderiveddata/

Finding sequence based data

Before we fetch any occurrence data let’s first find out which datasets in OBIS are using the DNADerivedData extension. The dataset() function takes many of the same parameters as the occurrence() function, including hasextensions which allows us to limit our search to occurrences which have linked extension records of a specific type. Possible values include MeasurementOrFact and DNADerivedData.

dna_datasets <- dataset(hasextensions = "DNADerivedData")


paged_table(dna_datasets %>%
  select(url, title, records, node_name, extent, id))
#write.table(dna_df, "dna_datasets_obis.txt", row.names = F, sep="\t")

#Alternatively the list is provided in the following:
dna_datasets <- read.csv("./solutions/dna_datasets_obis.txt", header=T, sep="\t")
dna_datasets_geo<-dna_datasets$extent[!is.na(dna_datasets$extent)]

edna_polygons <- st_as_sfc(dna_datasets_geo, crs = 4326)

mapview(edna_polygons)

#In case of error try setting: #mapviewOptions(fgb = FALSE)

Fetching sequence based data

Let’s fetch all occurrences from the PacMAN dataset by using the datasetid parameter. Alternatively, set hasextensions to DNADerivedData. Also set the extensions parameter to ensure that the nested extension records are included in the results.

pacmanID<-dna_datasets %>% filter(grepl("PacMAN", title))%>%
select(id)

occ <- occurrence(datasetid = pacmanID, extensions = "DNADerivedData", wrims = T)

#or in case no access to internet:
#occ <- read.csv2("./solutions/pacman_wrims_occurrences.txt", header=T, sep="\t")

Let’s take a look at the taxonomic composition of our dataset:

occ %>%
  group_by(phylum) %>%
  summarize(species = length(unique(scientificNameID)), records = n()) %>%
  arrange(desc(species))

Extracting DNADerivedData records

The DNADerivedData records are currently nested within the occurrences we fetched from OBIS. To extract the extension records we can use the unnest_extension() function. Pass any fields from the occurrence table you would like to keep using the fields parameter.

dna <- unnest_extension(occ, "DNADerivedData", fields = c("id", "phylum", "class", "order", "family", "genus", "species", "date_year"))

#Or in case no access to internet
#dna <- read.csv2("./solutions/pacman_wrims_dna.txt", header=T, sep="\t")

Let’s take a look at some properties of the sequence records.

dna %>%
  group_by(target_gene, pcr_primer_name_forward, pcr_primer_name_reverse) %>%
  summarize(records = n()) 

Working with DNADerivedData records

This is a quick demo of working with sequences from the DNADerivedData extension. For the remainder of the notebook I will work with a subset of the sequence data.

dna_subset <- dna %>%
  filter(phylum == "Chordata" & !is.na(species)) %>%
  head(100)

To be able to make use of existing bioinformatics packages we first need to convert our DNA_sequence column to a DNAbin object. This can be done using as.DNAbin function from the ape (Analyses of Phylogenetics and Evolution) package.

dna_subset_COI <- dna_subset %>% filter(target_gene =="COI")

sequences <- sapply(strsplit(as.character(dna_subset_COI$DNA_sequence), ""), tolower)
names(sequences) <- dna_subset_COI$id
sequences <- ape::as.DNAbin(sequences)
sequences
## 34 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 303.412 
##    Shortest sequence: 150 
##     Longest sequence: 313 
## 
## Labels:
## 01d01916-271a-4abc-bdef-40972f769b9a
## 02ed7f53-7fde-42e8-8d91-6c4511d655dc
## 042f1a54-cf5d-4dc8-8001-da1f2983db4e
## 05dc4c32-80c9-4e0b-aa08-eec029d0931a
## 0642175e-6ff5-4934-ad4a-0a25dad82a6b
## 08641ad5-0866-4ffb-9a80-e678c01f5bed
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.223 0.172 0.179 0.426 
## (Total: 10.32 kb)

Now we can align the sequences using MAFFT. The ips (Interfaces to Phylogenetic Software in R) package provides an interface to the MAFFT software.

aligned <- ips::mafft(sequences, method = "auto")
aligned
## 34 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 313 
## 
## Labels:
## 01d01916-271a-4abc-bdef-40972f769b9a
## 02ed7f53-7fde-42e8-8d91-6c4511d655dc
## 042f1a54-cf5d-4dc8-8001-da1f2983db4e
## 05dc4c32-80c9-4e0b-aa08-eec029d0931a
## 0642175e-6ff5-4934-ad4a-0a25dad82a6b
## 08641ad5-0866-4ffb-9a80-e678c01f5bed
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.223 0.172 0.179 0.426 
## (Total: 10.64 kb)

Let’s visualize the aligned sequences:

ape::image.DNAbin(aligned, col=c(RColorBrewer::brewer.pal(6, "Pastel1")), show.bases=T, show.labels=F, base.cex=0.4)

Finally, we can calculate a distance matrix and construct a tree using dist.dna and njs.

distances <- ape::dist.gene(aligned)

tree <- ape::njs(distances) %>%
  left_join(dna_subset %>% select(id, year = date_year, species), by = c("label" = "id"))

ggtree(tree, layout = "circular", cex = 0.3) +
  geom_tiplab(size = 2, aes(label = species, color = year)) +
  scale_color_viridis_c(option = "magma", end = 0.9)

Using the mapper

DNADerivedData extension records can also be included in data downloads from the mapper. To do so, tick the DNADerivedData checkbox in the download confirmation form.

mapper

The resulting archive will include CSV files for the occurrences as well as for the selected extensions. The occurrence and extension tables can be joined using the id column.

mapper

NEW Prototype: sequence search in OBIS

Having sequences in OBIS will allow global comparisons. A prototype sequence search tool has been developed by Pieter Provoost, that you can now try.

Note this is a fresh prototype!

AGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGAGCGGGCCAGTTGGTCGGCCGCAAGGTCTGTTACTGACTGGTCTGCTCTTCTTCGCAAAGACTGCGTGTGCTCTTGACTGAGTGTGCGTAGGATTTACGACGTTTACTTTGAAAAAATTAGAGTGTTCAAAGCAGGC

ACTTTGAAAAAATTAGAGTGCTCAAAGCAGGCTAAAATGCCTGAATATTCGTGCATGGAATAATAGAATAGGATGTCGATCCTATTTTGTTGGTTTTCGGGACTCGACATAATGATAAATAGGGACAGTCGGGGGCATTTGTATTCAAACGACAGAGGTGAAATTCTTGGACCGTTTGAAGACAAACTACTGCGAAAGCATTTG

mapper